Assignment1 Summary


Chosen expression dataset from GEO (GSE139242)
contact_address contact_city contact_country contact_department contact_email contact_institute
Kirkeveien 166, Laboratoriebygget OSLO Norway Department of Medical Genetics OUS

Initial Defined Groups
  1. cell type (CD4, CD8)
  2. tissues( thymus, bloodinfant, bloodadult)
  3. patients( or Individuals numbered from 1 to 5)
celltype tissue patient
CD4_thymus_1 CD4 thymus 1
CD4_thymus_2 CD4 thymus 2
CD4_thymus_3 CD4 thymus 3
CD4_thymus_4 CD4 thymus 4
CD4_bloodinfant_1 CD4 bloodinfant 1
CD4_bloodinfant_2 CD4 bloodinfant 2
CD4_bloodinfant_3 CD4 bloodinfant 3
CD4_bloodinfant_4 CD4 bloodinfant 4
CD4_bloodadult CD4 bloodadult NA
CD8_thymus_1 CD8 thymus 1
CD8_thymus_2 CD8 thymus 2
CD8_thymus_3 CD8 thymus 3
CD8_thymus_4 CD8 thymus 4
CD8_thymus_5 CD8 thymus 5
CD8_bloodinfant_1 CD8 bloodinfant 1
CD8_bloodinfant_2 CD8 bloodinfant 2
CD8_bloodinfant_3 CD8 bloodinfant 3
CD8_bloodadult CD8 bloodadult NA

Compare the distribution of pre and post normalization


MDS plots based on different groups


BCV Plot


MeanVar Plot

## 
## The downloaded binary packages are in
##  /var/folders/4w/5ghh1f910_zfbnfv6h5rst9w0000gn/T//Rtmp6IEueW/downloaded_packages

Assignment 2 Summary

This is a short glaze of what we start with:

The 15 samples are kept, The cell-type category is composed of CD4 and CD8, and there are 8 CD4 samples and 7 CD8 samples in the dataset. 2 samples of “bloodadult” tissue type was removed.

knitr::kable(normalized_count_data[c(1:5),c(1:5)], type="html")
ensembl_gene_id hgnc_symbol CD4_thymus_1 CD4_thymus_2 CD4_thymus_3
ENSG00000000003 ENSG00000000003 TSPAN6 2.750699 4.395522 1.618822
ENSG00000000419 ENSG00000000419 DPM1 37.402365 31.998891 36.502047
ENSG00000000457 ENSG00000000457 SCYL3 18.826214 20.721749 23.585623
ENSG00000000460 ENSG00000000460 C1orf112 34.115816 23.810148 31.870986
ENSG00000000938 ENSG00000000938 FGR 3.268688 4.664636 2.780002

Part1. Differential Gene Expression

Ranked genes with Top p-value

This is the resulting table with ranked genes with Top p-value

knitr::kable(output_hits2[1:5,],type="html")
ensembl_gene_id hgnc_symbol logFC AveExpr t P.Value adj.P.Val B
9800 ENSG00000172116 CD8B 278.623826 140.008071 21.51238 0 1.00e-07 4.884591
7408 ENSG00000153563 CD8A 486.339551 236.540282 15.09044 0 4.50e-06 4.271212
6043 ENSG00000138835 RGS3 41.451709 26.976475 13.07518 0 1.59e-05 3.913343
2600 ENSG00000106028 SSBP1 -46.072426 85.107571 -12.97505 0 1.59e-05 3.892086
13955 ENSG00000237506 RPSAP15 -2.103834 1.160512 -11.91344 0 3.79e-05 3.641291

Short summary of # of genes that pass the thresholds

Gene pass the threshold p-value < 0.05 from Model #2 (cell type + tissue)
6894
Genes pass correction from Model #2 (cell type + tissue)
4512
Gene pass the threshold p-value < 0.01 from Model #1 (Cell type only)
3960
Genes pass correction with adj.P-value < 0.01 (Cell type only)
1651
Using QLF method within EdgeR package

Calculate differential expression using the Quasi liklihood model with coef=“cell-type”

logFC logCPM F PValue FDR
ENSG00000172116 5.110761 7.478149 526.0195 0 0e+00
ENSG00000103569 6.253015 3.031245 471.4154 0 0e+00
ENSG00000256039 7.246702 6.229868 390.9360 0 0e+00
ENSG00000153563 6.602412 8.250343 346.2413 0 0e+00
ENSG00000138835 2.783237 5.138695 341.3913 0 0e+00
ENSG00000203747 6.606881 5.622688 308.5814 0 0e+00
ENSG00000181036 4.843257 3.666595 288.4766 0 0e+00
ENSG00000128294 1.969806 6.064547 261.2659 0 1e-07
ENSG00000172543 4.174503 6.433384 242.9130 0 1e-07
ENSG00000186407 6.631176 3.924736 231.4641 0 1e-07
x
BH
x
samples$celltypeCD8
x
glm

Short summary of genes that pass the threshold

Total genes
16205
Gene pass the threshold p-value < 0.05 using edgeR
5546
Genes pass correction with QLF model with p-value < 0.05
3775
Gene pass the threshold p-value < 0.01 using edgeR
3658
Genes pass correction with QLF model with p-value < 0.01
2079
Genes p-value < 0.05 and abs(logFC) >= 1
2006

Estimate the dispersion of my data

Using the MeanVar Plot to explore the mean-variance relationship. Only the trended dispersion is used under the quasilikelihood pipeline.

Calculate dispersion to observe variance deviates from the mean. Biological Coefficient of Variation (BCV) and dispersion is a measure of how much variation there in the samples.


3. Highlight genes of interest with MA Plot or a Volcano plot

Here is a summary of the number of Up regulated and down regulated genes:

Up Regulated Genes
3524
Down Regulated Genes
2022

HeatMaps of top hit genes using Limma Model#2 (accounting for cell-type and tissue variability)

  • With P.value < 0.05
  • Columns ordered by cell types (CD4 or CD8)

  • With adj.P.Val < 0.01
  • Columns ordered by Tissues (Thymus or Bloodinfant)

Generally we can observe the same cell type with same tissue type samples are definitely clustered in groups.

By reading the first heatmap we can observe that with in the Cell type groups, different tissue type shows differences in there gene expression pattern, with the second heatmap, within the same tissue, the gene of different cell type also express differently.


Thresholded over-representation analysis (ORA)

Which method did you choose and why?

I used g:profiler for the thresholded gene set enrichment analysis.

What annotation data did you use and why? What version of the annotation are you using?

I chose GO:BP, KEGG and REAC as my annotation data.

GO biological process: releases/2019-07-01 KEGG: KEGG FTP Release 2019-09-30 Reactome: ensembl classes: 2019-10-2

How many genesets were returned with what thresholds? The threshold is 0.05.

Upregulated genes: 1353 go bioligical process terms, 110 KEGG terms and 103 REAC terms

Downregulated genes: 657 go bioligical process terms,22 KEGG terms and 376 REAC terms

Diff genes: 538 go bioligical process terms, 31 KEGG terms and 63 REAC terms

Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

#####Up-Regulated Genes

#####Down-Regulated Genes

Up + Down Genes (all differentially expressed genes)

A3: Non-thresholded Gene Set Enrichment Analysis (GSEA)

Compute ranks and generate .rnk file

Download Geneset

Download the “Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt” from bader lab link: “http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/

GSEA parameters

Required field Gene sets database:Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt Number of permutations:1000 collapse/Remap to gene symbols:No_Collapse

Basic fields Enrichment statistic: weighted Max size: 200 Min size: 15

GSEA Result

Enrichment in phenotype: na + 2310 / 3211 gene sets are upregulated in phenotype na_pos + 1158 gene sets are significant at FDR < 25% + 522 gene sets are significantly enriched at nominal pvalue < 1% + 855 gene sets are significantly enriched at nominal pvalue < 5%

Enrichment in phenotype: na

  • 901 / 3211 gene sets are upregulated in phenotype na_neg
  • 566 gene sets are significantly enriched at FDR < 25%
  • 399 gene sets are significantly enriched at nominal pvalue < 1%
  • 487 gene sets are significantly enriched at nominal pvalue < 5%

The top hit for upregulated gene is IMMUNOREGULATORY INTERACTIONS BETWEEN A LYMPHOID AND A NON-LYMPHOID CELL%REACTOME%R-HSA-198933.5

ES = 0.68 NES = 3.38 FDR q-val= 0.000 size = 74 genes

The top hit for downregulated gene is TRANSLATION%REACTOME DATABASE ID RELEASE 71%72766

ES = -0.62 NES = -4.55 FDR q-val = 0.000 size = 129 genes

Compare to the previous result from g-profile

Create Enrichment map to visulize GSEA result

Adding annotation using autoannotate app

The top largest annotated clusters are:

  • hh proteasome degradation
  • decay heterocycle catabolic
  • electron transport metabolic
  • mitochondrial translational translation
  • export nucleus rna
  • neutrophil migration chemotaxis
  • secretion regulation protein
  • cell adhesion activation
  • ribonucleoprotein biogenesis subunit
  • migration chemotaxis positive
  • natural killer immunity
  • adenylate cyclase coupled
  • rna processing splicing

Interpretation

Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods?

Yes the enrichment result seems matching with the mechanism discussed in the paper such that the up regulated genes are falls into the categories of “IMMUNOREGULATORY INTERACTIONS BETWEEN A LYMPHOID AND A NON-LYMPHOID CELL”